Corpus-based Japanese morphological analysis

نویسندگان

Masayuki Asahara

Yuji Matsumoto

Shunsuke Uemura

Kiyohiro Shikano

Kentaro Inui

Clyde P. Kruskal

چکیده

The goal of this study is to improve corpus-based Japanese morphological analysis which is composed by word segmentation and part-of-speech (below POS) tagging. We divide the problem of Japanese morphological analysis into three subproblems: models for known word, models for unknown word and corpus maintenance schema. Firstly, we discuss Markov model-based approaches for known word processing. We point phenomena which are difficult to be analyzed by a simple Markov model. Special transactions are necessary for these phenomena. Therefore, we introduce three extensions for Markov model: lexicalized POS, position-wise grouping and selective trigram. Secondly, we discuss unknown word processing. We newly propose an offline model for unknown word based on a pattern recognition approach. Unknown words are extracted from the text by chunking in advance. Next, the POSs for the extracted words are estimated by a word sense disambiguation-like approach. Thirdly, we discuss maintenance schema for word segmented and POS tagged corpus. The corpus maintenance is a crucial issue for corpus-based models. We propose a relational database usage to keep consistency in the corpora. The relational database enables us synchronous transaction between the lexicon and the corpora. Therefore, the risk of discrepancy in the corpus is reduced by the proposed method. As side issues, we discuss Japanese named entity extraction and filler filtering. Japanese named entity extraction is an application in information extraction. We propose two extensions for the application. One is a character-based chunking method which solves a word boundary discrepancy problem. The other is use of point-wise n-best answers of Japanese morphological analyzer which makes the model robust. The proposed method achieves the best accuracy in the preceding works. Filler filtering is a preprocessing for Japanese morphological analysis. Many fillers and disfluencies appear in transcriptions of spoken language. These phenomena are factors of the errors in Japanese morphological analysis. We introduce a pattern recognition method for filler and disfluency filtering from the transcription.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

UniDic for Early Middle Japanese: a Dictionary for Morphological Analysis of Classical Japanese

In order to construct an annotated diachronic corpus of Japanese, we propose to create a new dictionary for morphological analysis of Early Middle Japanese (Classical Japanese) based on UniDic, a dictionary for Contemporary Japanese. Differences between the Early Middle Japanese and Contemporary Japanese, which prevent a naïve adaptation of UniDic to Early Middle Japanese, are found at the leve...

متن کامل

Applying Conditional Random Fields to Japanese Morphological Analysis

This paper presents Japanese morphological analysis based on conditional random fields (CRFs). Previous work in CRFs assumed that observation sequence (word) boundaries were fixed. However, word boundaries are not clear in Japanese, and hence a straightforward application of CRFs is not possible. We show how CRFs can be applied to situations where word boundary ambiguity exists. CRFs offer a so...

متن کامل

A Proper Approach to Japanese Morphological Analysis: Dictionary, Model, and Evaluation

In this paper, we discuss lemma identification in Japanese morphological analysis, which is crucial for a proper formulation of morphological analysis that benefits not only NLP researchers but also corpus linguists. Since Japanese words often have variation in orthography and the vocabulary of Japanese consists of words of several different origins, it sometimes happens that more than one writ...

متن کامل

Automatic Labeling of Voiced Consonants for Morphological Analysis of Modern Japanese Literature

Since the present-day Japanese use of voiced consonant mark had established in the Meiji Era, modern Japanese literary text written in the Meiji Era often lacks compulsory voiced consonant marks. This deteriorates the performance of morphological analyzers using ordinary dictionary. In this paper, we propose an approach for automatic labeling of voiced consonant marks for modern literary Japane...

متن کامل

Detecting Sentence Boundaries in Japanese Speech Transcriptions Using a Morphological Analyzer

We present a method to automatically detect sentence boundaries(SBs) in Japanese speech transcriptions. Our method uses a Japanese morphological analyzer that is based on a cost calculation and selects as the best result the one with the minimum cost. The idea behind using a morphological analyzer to identify candidates for SBs is that the analyzer outputs lower costs for better sequences of mo...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2003

Corpus-based Japanese morphological analysis

نویسندگان

چکیده

منابع مشابه

UniDic for Early Middle Japanese: a Dictionary for Morphological Analysis of Classical Japanese

Applying Conditional Random Fields to Japanese Morphological Analysis

A Proper Approach to Japanese Morphological Analysis: Dictionary, Model, and Evaluation

Automatic Labeling of Voiced Consonants for Morphological Analysis of Modern Japanese Literature

Detecting Sentence Boundaries in Japanese Speech Transcriptions Using a Morphological Analyzer

عنوان ژورنال:

اشتراک گذاری